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(57) ABSTRACT 

A computer system includes multiple processors, each of 
which includes an associated memory. Each of the proces- 
sors is capable of accessing the memory of all other pro- 
cessors. Memory can be stored and accessed using different 
addressing schemes. For data that will only be used by the 
local processor, data is stored in memory using processor 
contiguous addressing, so that data is stored in the local 
memory. For data that may be accessed by multiple 
processors, data is stored using striping among a local 
processor set. A stripe control register in the memory con- 
troller of each memory comprises a mask that indicates 
which memory blocks should be accessed using processor 
contiguous addressing and which should be accessed by 
using striped addressing. For both striped and contiguous 
addressing, the address space includes a processor identifi- 
cation field to identify the processor where the associated 
memory resides, together with an oiEset indicating where in 
memory the address is located. The processor identification 
field for striped addressing includes two bits located in low 
order address space identifying a four processor local stripe 
set. The other processor identification bits define which four 
processors comprise each stripe set. 

29 Claims, 5 Drawing Sheets 
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EFFICIENT ADDRESS INTERLEAVING 
WITH SIMULTANEOUS MULTIPLE 
LOCALITY OPTIONS 

CROSS-REFERENCE TO RELATED 
APPLICATIONS 
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filed Aug. 31, 2000,: "Fault Containment And Error Recov- 
ery Techniques In A Scalable Multiprocessor," Ser. No. 
09/651,949, filed Aug. 31, 2000, "Speculative Directory 
Writes In A Direaory Based Cache Coherent Nonuniform 
Memory Access Protocol," Ser. No. 09/652,834, filed Aug. 
31, 2000, "Special Encoding Of Known Bad Data," Ser. No. 
09/652,314, filed: Aug. 31, 2000, "Broadcast Invalidate 
Scheme," Ser. No. 09/652,165, filed Aug. 31, 2000, "Mecha- 
nism To Track AU Open Pages In A DRAM Memory 
System," Ser. No. 09/652,704, filed Aug. 31, 2000, "Pro- 
grammable DRAM Address Mapping Mechanism," Ser. No. 
09/653,093, filed Aug. 31, 2000, "Computer Architecture 
And System For Efficient Management Of Bi-Directional 
Bus/' Ser. No. 09/652,323, filed Aug. 31, 2000, "A High 
Performance Way Allocation Strategy For A Multi-Way 
Associative Cache System," Ser. No. 09/653,092, filed Aug. 
31, 2000, "Method And System For Absorbing Defects In 
High Performance Microprocessor With A Large N-Way Set 
Associative Cache," Ser. No. 09/651,948, filed Aug. 31, 
2000, "A Method For Reducing Directory Writes And 
Latency In A High Performance, Directory-Based, Coher- 
ency Protocol," Sen No. 09/652,324, filed Aug. 31, 2000, 
"Mechanism To Reorder Memory Read And Write Trans- 
actions For Reduced Latency And Increased Bandwidth," 
Ser. No. 09/653,094, filed Aug. 31, 2000, "System For 
Minimizing Memory Bank Conflicts In A Computer 
System," Ser. No. 09/652,325, filed Aug. 31, 2000, "Com- 
puter Resource Management And Allocation System," Ser. 
No. 09/651,945, filed Aug. 31, 2000, "Input Data Recovery 
Scheme," Ser. No. 09/653,643, filed Aug. 31, 2000, "Fast 
Lane Prefetching," Ser. No. 09/652,451, filed Aug. 31, 2000, 
"Mechanism For Synchronizing Multiple Skewed Source- 
Synchronous Data Channels With Automatic Initialization. 
Feature," Ser. No. 09/652,480, filed Aug. 31, 2000, "Mecha- 
nism To Control The Allocation Of An N-Source Shared 
Buffer," Ser. No. 09/651,924, filed Aug. 31, 2000, and 
"Chaining Directory Reads And Writes To Reduce DRAM 
Bandwidth In ADirectory Based CC-NUMA Protocol," Ser. 
No. 09/652,315, filed Aug. 31, 2000, all of which are 
incorporated by reference herein. 

STATEMENT REGARDING FEDERALLY 
SPONSORED RESEARCH OR DEVELOPMENT 

Not applicable. 

BACKGROUND OF THE INVENIION 

1, Field of the Invention 

The present invention generally relates to a computer 
system that includes a plurality of microprocessors. More 
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particularly, the invention relates to a multiple processor 
computer system with distributed memory sub -systems 
accessible by the processors in the system. Still more 
particularly, the present invention relates to an improved 
5 system and method that supports multiple address interleav- 
ing techniques that can be active simultaneously to reduce 
latency and increase memory bandwidth. 
2. Background of the Invention 

One of the basic issues in any computer system is deter- 

10 mining the most efficient technique to address the various 
memory devices that are present in the system. The memory 
in a computer system stores data and instructions for sub- 
sequent retrieval and use by the processor and other com- 
ponents in the computer system. To facilitate the storage, 

35 retrieval and subsequent use of they data and instructions, 
the processor and other computer system components must 
be able to identify the address of the stored data. Typically, 
the computer system implements a defined protocol for 
assigning addresses to stored data. Whenever data is written 

20 or read from memory, the component requesting the trans- 
action transmits an address signal or command to the 
memory identifying where the data should be written, or 
conversely, from where the data should be read. The 
memory typically has an associated memory controller that 

25 includes an address decoder that decodes the bits in the 
address signal to determine the location within memory 
being accessed. In a conventional memory system, this 
includes identifying the page of memory, and within the 
page, the row and column of the data being written or read. 

30 The particular coding in the address signal or command 
typically identifies the starting address of a particular 
memory device, while other bits identify the offeet within 
the memory device where the particular access is targeted. 
When data is written into memory, typically continuous 

35 memory addresses are used to identify contiguous memory 
locations. Thus, for example, the address 8001 will be 
followed by address 8002 (both of which would be written 
in binary format) to identify adjacent memory locations 
within a page of memory. More recently, it has become 

40 commonplace to include banks of memory within a com- 
puter system, so that a conventional personal computer 
system may include a single processor with multiple 
memory banks accessible via different memory ports. Some 
or all of the memory banks may be populated with some 

45 form of dynamic random access memory ("DRAM"). In 
systems with multiple memory banks, it has become com- 
mon to implement some form of interleaving to more 
efficiently distribute the data within the memory banks. 
Thus, for example, each continuous address of memory may 

50 be disuibuted among different memory banks, instead of 
within a single memory bank. The advantage of such an 
interleaving scheme is that it may increase memory 
bandwidth, because it permits the higher speed processor to 
conduct overlapping memory transactions to the slower 

55 speed memory banks via the different memory ports. 

To implement an interleaving scheme in a single proces- 
sor system, certain bits in the address command are selected 
to identify the memory bank being accessed. Thus, for 
example, if eight memory banks are available in the system, 

60 three of the address bits might be used to identify a specific 
memory bank. If these three address bits are the low order 
bits in the address command, then consecutive memory 
addresses are distributed across the memory banks automati- 
cally by the system hardware. In such a system, the address 

65 8000 might correspond to an address location in memory 
bank 1, while address 8001 might correspond to an address 
location in memory bank 2. Thus, by using the low order 
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address bits to define the memory bank, the system will Such software implementations, however, require involve- 

interleave data among memory banks as the operating ment of the processor, and thus may act as a drag on system 

system incremenLs through the address space. perfonnancc. In addition, simultaneous software interleav- 

If conversely, the three address bits identifying the mg can be very expensive since it requires many op^^^^^^ 

u 1 u- u A U r.u^,,^ iu^ k.fo s to convert addresses to a canonical form necessary ror the 

memory bank are high order bits (above the bits denuding ^^^^^^^^ goftware interleaving also can be difficult to 

the virtual page size) then address mterleaving typically is ; ^^af^i^„^^ .j^^k cycles for each 

performed as part of the software translation from the virtual memory transaction performed. It would be advantageous to 

address lo the physical address. Thus, in this type of system, develop a hardware address scheme that permits simulta- 

the interleaving is determined by software page placement ngj,us interleaving without the attendant problems caused by 

policy choices typically programmed into the operating software interleaving. 

system. BRIEF SUMMARY OF THE INVENTION 

In a distributed memory, mula-processor computer , . . , . . 

system, the memory is distributed throughout the computer Th« problems noted above are solved in large part by the 

system, and is not located in one finite focation. In particular. sy^'^^f lechmques of the present invention, which 

one technique for implementing such a system is to associate P."«f "'''M* diff"f.°' f^'^'', interleavmgs to be acUve 

... . ■ ' , , t:„„u Simultaneously. In particular, unstnped addresses are used to 

memory with each processor m the computer system^ Each ^^^^^^^^ ^^^^ processors using high order address bits, 

of the processor withm the system may be capable of ^jj^^^ instructions to be copied locally to all processors 

accessmg the memory associated with any other processor ^^^^ all instructions are transmitted 

by properly transmitting a command coupled with the ^.^^ j ^ addresses, conversely, interleave 

de.sired memory address to the appropriate memory locaUon. ^^^^^ ^^^^ ^^^^^^ ^^^^ 

Idenhfymg an address withm any particular memory low- ^^^^^^ ^, ^^^er. Hiis makes a group of four 

tion requires selectmg the processor associated with the ^^^^.^^^ jf,^ ^^^^^^^ ^^^^ ^^^^ ^jth data references distrib- 

memory. ^^^^ memory ports of the four processor set. The 

Because memory is distributed throughout the computer striping of addresses within a four processor set reduces 

system, and multiple processors exist that may each simul- bottlenecks that may occur when other processors request 

taneously seek to access the same memory device or even ^^^^ associated with memory of a different processor. The 

the same memory data, special steps must be implemented simultaneous use of striped and unstriped addresses can 

to insure coherency of the data, while stUl maximizing the improve system performance, without the attendant defi- 

speed of memory accesses to minimize system latency. In an ciencies of software implemented systems, 

attempt to reduce latency (or "waiting") caused by coinci- interleave scheme implemented in the preferred 

dent accesses to the same memory location, memory may be embodiment of the present invention uses an address bit to 

distributed within a particular processor sub-system by distinguish between two different types of address 

including mulUplc memory ports supportmg separate interleaving-striped and unstriped. Preferably each proces- 

memory banks. This adds yet another level of detail that includes two memory ports, with an entire cache block 

must be identified in the address coding scheme. Thus, in assigned to a single memory port. Id both striped and 

addition to the processor identification, the address com- ^^^.^^^^ interleavings, the lowest order address bits (0-5) 

mand must also identify the memory bank and the memory indicate the cache alignment, and address bit 6 indicates the 

offset for that parUcular memory bank. ^^^^ ^^^^^ ^ processor. The unstriped interleave identifies 

The conventional technique for addressing memory in a 40 the cache block within a port in address bits 7-33, and the 
distributed memory computer system is to have the operat- lo^^j. processor bits in address bits 34 and 35. llic 
ing system assign continuous address references to contigu- striped interleave has the lower two processor bits in address 
ous locations on the same processor Thus, typically the high bits 7 and 8, and the cache block in address bits 37-43 (for 
order bits in the address define the processor, and the lower ^ system with up to 256 processors, each of which can have 
order bits define the offiset in the memory associated with 45 15 GB of memory distributed across 2 ports), 
that processor. Thus, as the operating system increments In accordance with the preferred embodiment, the present 
through the address space, the processor being accessed does invention is implemented in hardware. In response to a 
not change, as Uie lower order address bits are incremented. memory access that results in a cache miss, the hardware 
Thus, incrementing address space means that the data trans- converts the address into a single canonical form which has 
actions occur locally on a given processor. Such a situation 50 the port, offset, and processor fields in fixed positions. These 
may be advantageous if the local processor is tiie source of address bits are then transferred along with bit 36, which 
the data transactions because it reduces the latency of the comprises the stripe bit, to the port. The port returns the 
memory transactions by avoiding the necessity of transmit- ^^^^^ ^^^^^y^ response. If necessary, the port may 
ting commands to another processor to obtain the requested re-convert the address into its original form using the stripe 
data. In other instances, however, this addressing scheme 55 jf jt needs to extract the block from another processor's 
may be unfavorable. If, for example, multiple processors are ^^^^^^ ^^^^ conversion to the canonical form, the hardware 
referencing tiie same contiguous piece of memory associated manages the interleaving uniformly for each case by for- 
with a different processor, a bottieneck may occur as each warding Uie reference to the appropriate memory port- 
requesting processor tries to simultaneously communicate According to the preferred embodiment, the striped inter- 
with the processor tiiat controls the targeted memory. leave is used for data that is more likely to be accessed by 

Because the processor identification occurs in the high otj,er processors, while unstriped interleaves is used for data 

order bits of the address signal, typically the interleaving of that is likely to only be accessed by the local processor, 

data among processors is performed through software. Thus, nPSPRTPTTON OF THE DRAWINGS 

in high order interleaving systems that are used with mul- ^^^^^ DESCRIPTION OF THE DRAWINGS 
tiple processing systems, the task of distributing addresses is 65 For a detailed description of the preferred embodiments of 

made at a page granularity level by the system software the invention, reference will now be made to the accompa- 

when it determines the virtual -to -physical page translation. nying drawings in which: 
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FIG. 1 shows a system level diagram of a multiple devices, but other types of memory devices can be used if 

processor system coupled together in accordance with the desired. The capacity of the memory devices 102 may be of 

preferred embodiment of the present invention; any suitable size. The memory devices 102 preferably are 

FIGS. 2a and 2b show a block diagram of one of the implemented as Rambus Interface Memory Modules 
processors depicted in the preferred embodiment of FIG. 1; ^ ("RIMMS"). 

nG, 3 shows sixteen processors, with associated memory I" general, computer system 90 can be configured so that 

ports, grouped into four local striped sets in accordance with any processor 100 can access its own memory 102 and I/O 

an exemplary embodiment of the present invention; and devices as well as the memory and I/O devices of aU other 

HGS. 4a and 4b illustrate exemplary address command ^„ P^o^s^^ the system. Preferably the computer system 

signals for an unstriped address interleave and for a striped "^^^ physical connections between each processor 

address interleave resoectivelv resulting m low interprocessor commumcaUon Umes and 

address mterleave, respectively. improved memory and VO device access reliability. If 

NOTATION AND NOMENCLATURE physical connections are not present between each pair of 

processors, a pass-through or bypass path preferably is 
Certain terms are used throughout the following descrip- 15 available for each processor to access the memory and I/O 
tion and claims to refer to particular system components. As devices of any other processor through one or more inter- 
one skilled in the art will appreciate, computer companies mediary processors, as graphically depicted in FIG. 1. 
may refer to a component by different names. Tbis document x^e processors may be implemented with any suitable 
does not intend to distinguish between components that microprocessor architecture, although the Alpha processor is 
differ in name but not function. In the following discussion 20 preferred embodiment. Therefore, to aid in 
and in the claims, the terms "including** and "comprising" understanding the preferred embodiment of the present 
are used in an open-ended fashion, and thus should be invention, details regarding the preferred processor archi- 
interpreted to mean "including, but not limited to . , . " Also, tccture will be described with reference to FIGS. 2a and 2b, 
the tenm "couple^' or "couples'' is intended to mean either an understanding that this architecture is not a man- 
indirect or direct electrical connection. Thus, if a first device 25 ^atory requirement to practice the present invention. After 
couples to a second device, that connection may be through discussing the preferred processor architecture with refer- 
a direct electrical connection, or through an indirect elec- gnce to FIGS. 2a and 2b, the present invention will be 
trical connection via other devices and connections. To the addressed in further detail with reference to FIGS. 3, 4a and 
extent that any term is not specially defined in this 4^ 

specification, the intent is that the term is to be given it's 30 '^^^^^-^^ and 2b, each processor 100 

plam and ordinary meaning. preferably includes an instruction cache 110, an instruction 

DETAILED DESCRIPTION OF THE w^.^rJ^J'lan'^'' ("I^"^!') ^^O, an integer executiori 

PREFERRED EMBODIMENTS ^']'^ ^ ^^^"^ > ^ ^^^^^"S-point exem unit ( Fbox ) 

140, a memory reference umt ( Mbox ) 150, a data cache 

Referring now to FIG. 1, in accordance with the preferred 160, an L2 instruction and data cache control unit ("Cbox") 

embodiment of the invention, computer system 90 com- 170, a level L2 cache 180, two memory controllers 

prises one or more processors 100 coupled to a memory ("ZboxO" and "Zboxl") 190, and an interprocessor and 1/0 

sub-system 102 and an input/output ("I/O") controller 104. router unit ("Rbox") 200. The following discussion 

As shown in FIG. 1, computer system 90 includes multiple describes each of these units in more detail, 

processors 100 (twelve such processors are shown for pur- Each of the various functional units 110-200 contains 

poses of illustration), with each processor coupled to an control logic that communicates with the control logic of 

associated memory sub-system 102 and an I/O controller other functional units, as shown in FIGS. 2a and 2b, Thus, 

104. Each processor 100 preferably includes four ports for referring still to FIGS. 2a and 2k the instruction cache 

connection to adjacent processors. The inter-processor ports control logic 110 communicates with the Ibox 120, Cbox 

are designated "north,** "south," "east," and "west" in accor- 170, and L2 Cache 180. In addition to the control logic 

dance with the well-known Manhattan grid architecture. As communicafing with the instmction cache 110, the Ibox 

such, each processor 100 can be connected to four other control logic 120 communicates with Ebox 130, Fbox 140 

processors. The processors on both end of the system layout and Cbox 170. The Ebox 130 and Fbox 140 control logic 

preferably wrap around and connect to processors on the both communicate with the Mbox 150, which in turn com- 

opposite side to implement a 2D torus-type connection. municates with the data cache 160 and Cbox 170. The Cbox 

Although twelve processors 100 are shown in the exemplary control logic also communicates with the L2 cache 180, 

embodiment of FIG. 1, any desired number of processors Zboxes 190, and Rbox 200. 

can be included. In the preferred embodiment, computer Referring still to FIGS. 2a and 2^ the Ibox 120 preferably 
system 90 is designed to accommodate either 256 processors .^^j^^^^ ^ ^^^^^ ^^-^ which contains a virtual program 
or 128 processors, depending on the size of the memory counter ("VPC") 122, a branch predictor 123, an instruction- 
associated with the processors. ^^^^^^ translation buffer 124, an instruction predecoder 125, 

The I/O controller 104 provides an interface to various a retire unit 126, decode and rename registers 127, an integer 

input/output devices, such as disk drives 105 and 106, as instruction queue 128, and a floating point instruction queue 

shown in the lower left-hand side of FIG. 1. Data from the 129. Generally, the VPC 122 maintains virtual addresses for 

I/O devices thus enters the 2D torus via the I/O controllers instructions that are in-flight. An instruction is said to be 

associated with the various processors. In addition to disk "in-flight" from the lime it is fetched until it retires or aborts, 

drives, other input/output devices also may be connected to jhe Ibox 120 can accommodate as many as 80 instructions, 

the I/O controllers, including for example, keyboards, mice, in 20 successive fetch slots, in-flight between the decode and 
CD-ROMs, DVD-ROMs, PCMCIA drives, and the like. 65 rename registers 127 and the end of the pipeline. The VPC 

In accordance with the preferred embodiment, the 122 preferably includes a 20-entry queue to store the fetched 

memory 102 preferably comprises RAMbus™ memory VPC addresses. 



01/26/2004, EAST Version: 1.4.1 



us 6,567, 

7 

The branch predictor 123 is used by the Ibox 120 for 
predicting the outcome of branch instructions. A branch 
instruction requires program execution cither to continue 
with the instruction immediately following the branch 
instruction if a certain condition Ls met, or branch to a 5 
different instruction if the particular condition is not met. 
Accordingly, the outcome of a branch instruction is not 
known until the instruction is executed. In a pipelined 
architecture, a branch instruction (or any instruction for that 
matter) may not be executed for at least several, and perhaps 
many, clock cycles after the fetch unit in the processor 
fetches the branch instruction. In order to keep the pipeline 
full, which is desirable for efiBcient operation, the processor 
preferably includes branch prediction logic that predicts the 
outcome of a branch instruction before it is actually 
executed (also referred to as "speculating"). The branch 
predictor 123, which receives addresses from the VPC queue 
122, preferably bases its speculation on short and long-term 
history of prior instruction branches. As such, using branch 
prediction logic, the fetch unit can speculate the outcome of 
a branch instruction before it is actually executed. The 20 
speculation, however, may or may not turn out to be 
accurate. Branch predictor 123 uses any suitable branch 
prediction algorithm that results in correct speculations more 
often than misspeculations, enhancing the overall perfor- 
mance of the processor 25 

The instruction translation buffer ("ITB") 124 couples to 
the instruction cache 110 and the fetch unit 121. The 1TB 
124 comprises a 128-entry, fully-associative instruction- 
stream translation buffer that is used to store recently used 
instruction-stream address translations and page protection 3Q 
information. Preferably, each of the entries in the ITB 124 
may be 1, 8, 64 or 512 contiguous 8-kilobyte ("KB") pages 
or 1, 32, 512, 8192 contiguous 64-kilobyte pages. The 
allocation scheme used for the ITB 124 is a round-robin 
scheme, although other schemes can be used as desired, 35 

The predecode logic 125 reads an octaword (16 contigu- 
ous bytes) from the instruction cache 110. Each octaword 
read from the instruction cache 110 may contain up to four 
naturally aligned instructions per cycle. Branch prediction 
and line prediction bits accompany the four instructions 40 
fetched by the predecoder 125. The branch prediction 
scheme implemented in branch predictor 123 generally 
works most efficiently when only one branch instruction is 
contained among the four fetched instructions. The prede- 
coder 125 predicts the instruction cache line that the branch 45 
predictor 123 will generate. The predecoder 125 generates 
fetch requests for additional instruction cache lines and 
stores the instruction stream data in the instruction cache. 

Referring still to FIGS. 2a and 2b, the retire unit 126 
fetches instructions in program order, executes them out of 50 
order, and then retires (also called "committing^' an 
instruction) them in order, 'rhe Ibox 120 logic maintains the 
architectural state of the processor by retiring an instruction 
only if all previoiis instructions have executed without 
generating exceptions or branch mispredictions. An excep- 55 
tion is any event that causes suspension of normal instruc- 
tion execution. Retiring an instruction commits the proces- 
sor to any changes that the instruction may have made to the 
software accessible registers and memory. The processor 
100 preferably includes the following three machine code 60 
accessible hardware: integer and floating-point registers, 
memory, and internal processor registers. With respect to the 
present invention, one of the internal process registers for 
the Cbox 170 is the Cbox stripe control register (with 
machine code mnemonic CBOX_STP_CTL). 65 

The decode and rename registers 127 contain logic that 
forwards instructions to the integer and floating-point 
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instruction queues 128, 129. The decode and rename regis- 
ters 127 preferably eliminate register write -after-read 
("WAR") and write -after- write ("WAW") data dependency 
while preserving true re ad-after- write ("RAW") data depen- 
dencies. This permits instructions to be dynamically 
rescheduled. In addition, the decode and rename registers 
127 permit the processor to speculatively execute instruc- 
tions before the control flow preceding those instructions is 
resolved. 

The logic in the decode and rename registers 127 prefer- 
ably translates each instruction's operand register specifiers 
from the virtual register numbers in the instruction to the 
physical register numbers that hold the corresponding 
architecturally-correct values. The logic also renames each 
instruction destination register specifier from the virtual 
number in the instruction to a physical register number 
chosen from a Hst of free physical registers, and updates the 
register maps. The decode and rename register logic 127 can 
process four instructions per cycle. Preferably, the logic in 
the decode and rename registers 127 does not return the 
physical register, which holds the old value of an instruc- 
tion's virtual destination register, to the free list until the 
instruction has been retired, indicating that the control flow 
up to that instruction has been resolved. 

If a branch misprediction or exception occurs, the register 
logic backs up the contents of the integer and floating-point 
rename registers to the state associated with the instruction 
that triggered the condition, and the fetch unit 121 restarts at 
the appropriate Virtual Program Counter ("VPC"). 
Preferably, as noted above, 20 valid fetch slots containing up 
to 80 instructions can be in flight between the registers 127 
and the end of the processor's pipeline, where control flow 
is finally resolved. The register 127 logic is capable of 
backing up the contents of the registers to the state associ- 
ated with any of these 80 instructions in a single cycle. The 
register logic 127 preferably places instructions into the 
integer or floating-point issue queues 128, 129, from which 
they are later issued to functional units 130 or 136 for 
execution. 

The integer instruction queue 128 preferably includes 
capacity for 20 integer instructions. The integer instruction 
queue 128 issues instructions at a maximum rate of four 
instructions per cycle. The specific types of instructions 
processed through queue 128 include: integer operate 
commands, integer conditional branches, unconditional 
branches (both displacement and memory formats), integer 
and floating-point load and store commands, Privileged 
Architecture Library ("PAL") reserved instructions, integer- 
to-floating-point and floating-point -integer conversion com- 
mands. 

Referring still to FIGS. 2a and 2b, the integer execution 
unit (Ebox) 130 includes arithmetic logic units ("ALUs") 
131, 132, 133, and 134 and two integer register files 135. 
Ebox 130 preferably comprises a 4-path integer execution 
unit that is implemented as two functional -unit "clusters" 
labeled 0 and 1. Each cluster contains a copy of an 80-entry, 
physical-register file and two subcluslers, named upper 
("U") and lower ("L"). As such, the subclusters 131-134 are 
labeled UO, LO, Ul, and LI. Bus 137 provides cross-cluster 
communication for moving integer result values between the 
clusters. 

The subclusters 131-134 include various components that 
are not specifically shown in FIG. 2a. For example, the 
subclusters preferably include four 64-bit adders that are 
used to calculate results for integer add instructions, logic 
units, barrel shifters and associated byte logic, conditional 
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branch logic, a pipelined multiplier for integer multiply 142, a floating-point multiply unit 144 and a register file 146. 

operations, and other components known to those of ordi- Floating-point add, divide and square root operations are 

nary skill in the art. handled by the floating-point add, divide and square root 

Each entry in the integer instruction queue 128 preferably calculation unit 142 while floating-point operations arc 

asserts four request signals — one for each of the Ebox 130 5 handled by the multiply unit 144. 

subclusters 131, 132, 133, and 134. A queue entry asserts a register file 146 preferably provides storage for 72 

request when it contains an mstruction that can be executed ^^^^^^ including 31 floating-point registers and 41 values 

by the subcluster if the instruction soperand register vaues ^^^^^^ ^ instructions that have not yet been retired. The 

are available withm the subd uster. Hie mteger instruction ^ ^^^^^ ^^^^ ^^^^ ^^.^^ 

queue 128 includes two arbiters — one for the upper sub- -^ni. j * 

clusters 132 and 133 and another arbiter for the lower ^^'^ (^°^ specificdly shown)^ Four read ports are used to 

subclusters 131 and 134. Each arbiter selects two of the ^^^^^^^ ^P^^^"^ ^^f'^^ pipe mes, and two 

possible 20 requesters for service each cycle. Preferably, the '^^^ P^^^ ^""'^^ s*^^^ mstructions. T\vo 

integer instruction queue 128 arbiters choose between simul- write ports are used to write results generated by the add and 

taneous requesters of a subcluster based on the age of the multiply pipelmes, and two wnte ports are used to write 

request— older requests are given priority over newer results from floating-point load instructions, 

requests. If a given instruction requests both lower Referring still to FIG. 2a^ the Mbox 150 controls the LI 

subclusters, and no older instruction requests a lower data cache 160 and ensures architecturally correct behavior 

subcluster, then the arbiter preferably assigns subcluster 131 for load and store instructions. The Mbox 150 preferably 

to the instruction. If a given instuction requests both upper contains a datastream translation buffer ("DTB") 151, a load 

subclusters, and no older instruction requests an upper 20 ^^^^^ ("LQ") 152, a store queue ("SQ") 153, and a miss 

subcluster, then the arbiter preferably assigns subcluster 133 address file ("MAF") 154. The DTB 151 preferably com- 

to the instruction. prises a fully associative translation bufler that is used to 

The floating-point iastruction queue 129 preferably com- ^^^^^ ^^^^ ^^^^^^ ^^^^^^ translations and page protection 

prises a 15-entry queue and issues the following types of information. Each of the entries in the DTB 151 can map 1, 

instnicUons: floating-point operates, floating-point condi- 25 g ^ contiguous 8-KB pages. The allocation scheme 

Uona branches, floating-point stor^, and floating-pomt reg- ,^,,,^1 ^3 .^^^ ^^in, although other suitable schemes 

ister to integer register transfers. Each queue entry prefer- ^^^^^ ^ 

ably includes three request lines — one for the add pipehne, kt u AiAoxT»v j . ■ ajj 

one for the multiply pipeline, and one for the two store f^^^^^l^.^P^"!.^^^,^^^^^^" "^^'T 'l^^^T 

pipeHnes. The floating-point instruction queue 129 includes 3^ Space Match ("ASM") bit. TTie ASN is an optionally imple- 

three arbiters^ne for each of the add, multiply, and store ^^""^^^ ^^g^^^^^ ^^^^ reduce the need for invahdation of 

pipelines. The add and multiply arbiters select one requester ^^''^^^ address translations for process-specific addresses 

per cycle, while the store pipeline arbiter selects two ^^en a context switch occurs. 

requesters per cycle, one for each store pipeline. As with flie The LQ 152 preferably comprises a reorder buffer used 

integer instruction queue 128 arbiters, the floating-point 35 for load instructions. It contains 32 entries and maintains the 

instruction queue arbiters select between simultaneous state associated with load instructions that have been issued 

requesters of a pipeUne based on the age of the request— to the Mbox 150, but for which results have not been 

older request are given priority. Preferably, floating-point delivered to flie processor and the instructions retired. The 

store instructions and floating-point register to integer reg- Mbox 150 assigns load instructions to LQ slots based on the 

ister transfer instructions in even numbered queue entries order in which they were fetched from the instruction cache 

arbitrate for one store port. Floating-point store instructions HO, and then places them into the LQ 152 after they are 

and floating-point register to integer register transfer instruc- issued by the integer instruction queue 128. The LQ 152 also 

tions in odd numbered queue entries arbitrate for the second helps to ensure correct memory reference behavior for the 

store port. processor. 

Floating-point store instructions and floating-point regis- 45 The SQ 153 preferably is a reorder buffer and graduation 

ter to integer register traasfer instructions are queued in both unit for store instructions. It contains 32 entries and main- 

the integer and floating-point queues. These instructions tains the state associated with store instructions that have 

wait in the floating-point queue until their operand register been issued to the Mbox 150, but for which data has not been 

values are available from the floating-point execution unit written to the data cache 160. The Mbox 150 assigns store 

("Fbox") registers. The instructions subsequently request 50 instructions to SQ slots based on the order in which they 

service from the store arbiter. Upon being issued from the were fetched from the instruction cache 110 and places them 

floating-poiat queue 129, the instructions signal the corre- into the SQ 153 after they arc issued by the instruction cache 

sponding entry in the integer queue 128 to request service. HO. The SQ 153 holds data associated with the store 

Finafly, upon being issued from the integer queue 128, the instructions issued from the integer instruction unit 128 until 

operation is completed. 55 they are retired, at which point the store can be allowed to 

The integer registers 135, 136 preferably contain storage tipdate the data cache 160. The LQ 152 also helps to ensure 

for the processor's integer registers, results written by correct memory reference behavior for the processor 

instructions that have not yet been retired, and other infor- The MAF 154 preferably comprises a 16-entry file that 

mation as desired. The two register files 135, 136 preferably holds physical addresses associated with pending instruction 

contain identical values. Each register file preferably 60 cache 110 and data cache 160 fill requests and pending 

includes four read ports and six write ports. The four read input/output ("I/O") space read transactions, 

ports are used to source operands to each of the two Processor 100 preferably includes two on-chip primary- 

subclusters within a cluster. The six write ports are used to level ("LI") instruction and data caches 110 and 160, and a 

write results generated within the cluster or another cluster single secondary-level, unified instruction/data ("L2") cache 

and to write results from load instructions. 65 180 (FIG. 2^). The LI instruction cache 110 preferably 

The floating-point execution queue ("Fbox") 129 contains comprises a 64-KB virtual-addressed, two-way set- 

a floating-point add, divide and square-root calculation unit associative cache. Prediction logic improves the perfor- 
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mance of the two-way set- associative cache without slowing 
the cache access time. Each instruction cache block prefer- 
ably contains a plurality (preferably 16) instructions, virtual 
tag bits, an address space number, an address space match 
bit, a one -bit PAIx;ode bit to indicate physical addressing, a 
valid bit, data and tag parity bits, four access-check bits, and 
predecoded information to assist with instruction processing 
and fetch control. 

The LI data cache 160 preferably comprises a 64 KB, 
two-way set associative, virtually indexed, physically 
tagged, write-back, read/write allocate cache with 64-byte 
cache blocks. During each cycle the data cache 160 prefer- 
ably performs one of the following transactions: two quad- 
word (or shorter) read transactions to arbitrary addresses, 
two quadword write transactions to the same aligned 
octaword, two non-overlapping less- than quadword writes 
to the same aligned quadword, one sequential read and write 
transaction from and to the same aligned octaword. 
Preferably, each data cache block contains 64 data bytes and 
associated quadword ECC bits, physical tag bits, vahd, dirty, 
shared, and modified bits, tag parity bit calculated across the 
tag, dirty, shared, and modified bits, and one bit to control 
round-robin set allocation. The data cache 160 is organized 
to contain two sets, each with 512 rows containing 64-byte 
blocks per row (i.e., 32-KB of data per set). The processor 
100 uses two additional bits of virtual address beyond the 
bits that specify an 8-KB page in order to specify the data 
cache row index. A given virtual address might be found in 
four unique locations in the data cache 160, depending on 
the virtual-to-physical translation for those two bits. The 
processor 100 prevents this aliasing by keeping only one of 
the four possible translated addresses in the cache at any 
time. 

The L2 cache 180 preferably comprises a 1.75-MB, 
seven-way set associative write-back mixed instruction and 
data cache. Preferably, the L2 cache holds physical address 
data and coherence state bits for each block. 

Referring now to FIG. 2b, the L2 instruction and data 
cache control unit ("Cbox") 170 controls the L2 instruction 
and data cache 190 and system ports. As shown, the Cbox 
170 contains a fill buffer 171, a data cache victim buffer 172, 
a system victim buffer 173, a cache miss address file 
("CMAF") 174, a system victim address file ("SVAF") 175, 
a data victim address file ("DVAF") 176, a probe queue 
("PRBQ") 177, a requester miss-address file ("RMAF") 178, 
a store to I/O space ("STIO") 179, and an arbitration unit 
181. In addition, the Cbox 170 also preferably includes a 
stripe control register 183 that functions as a mask for 
memory blocks in the associated memory, indicating 
whether each memory block may be accessed with striped 
addressing techniques, as disclosed herein. 

The fill buffer 171 preferably buffers data received from 
other functional units external to the Cbox. The data and 
instructions are written into the fill buffer 171, and other 
logic units in the Cbox process the data and instructions 
before relaying to other fuinctional units or the LI cache. 
The data cache victim buffer ("VDF**) 172 preferably stores 
data flushed from the LI cache or sent: to the System Victim 
Data Buffer 173. The System Victim Data Buffer ("SVDB") 
173 is used to send data flushed from the L2 cache to other 
processors in the system and to memory. Cbox Miss- 
Address File ("CMAF') 174 preferably holds addresses of 
any transaction that results in an LI cache miss. CMAF 
updates and maintains the status of these addresses. The 
System Victim-Address File ("SVAF') 175 in the Cbox 
preferably contains the addresses of all SVDB data entries. 
The Data Victim-Address File ("DVAF") 176 preferably 
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contains the addresses of all data cache victim buffer 
f'VDF') data entries. 

The Probe Queue ("PRBQ") 177 preferably comprises an 
1 8-entry queue that holds pending system port cache probe 

5 commands and addresses. This queue includes 10 remote 
request entries, 8 forward entries, and lookup L2 tags and 
requests from the PRBQ content addressable memory 
f 'CAM") against the RMAF, CMAF and SVAF. Requestor 
Miss-Address Files ("RMAF") 178 in the Cbox preferably 
accepts requests and responds with data or instructions from 
the L2 cache. Data accesses from other functional units in 
the processor, other processors in the computer system or 
any other devices that might need data out of the L2 cache 
are sent to the RMAF for service. The Store Input/Output 

35 ("STIO") 179 preferably transfer data from the local pro- 
cessor to I/O cards in the computer system. Finally, arbitra- 
tion unit 181 in the Cbox preferably arbitrates between load 
and store accesses to the same memory location of the L2 
cache and informs other logic blocks in the Cbox and other 

2Q computer system functional units of any conflict. 

The stripe control register 183 preferably comprises a 64 
bit register that serves as a mask representing memory 
blocks in the memory sub-system associated with each 
processor. Each bit in the stripe control register 183 repre- 

25 sents either 256 MB or 512 MB of memory. Thus, the full 
64 bits represent the maximum 16 GB or 32 GB of memory 
associated with a particular processor. If the corresponding 
mask bit is clear, the memory block must be addressed 
without striping, using processor contiguous addressing. 

3Q Conversely, if the corresponding mask bit is set, the memory 
block must be referenced with address striping. The deter- 
mination of whether a memory block must be addressed by 
striping may be made in many ways, as will be apparent to 
one skilled in the art. In the preferred embodiment, this 

35 determination is based on whether the data stored in the 
memory block will be accessed solely by the local processor, 
or whether other processors also may access the data block. 
This determination may be historically based, depending on 
the prior access history of similar type data. Other predictive 

40 logic also may be used, if desired, to set or clear the 
corresponding mask bit for a memory block. 

The stripe control register may preferably be used to 
enable the associated computer to issue references to a 
memory location even though the location does not exist. 

45 Such a memory reference may be referred to as a non- 
existent memory reference ("NXM"). In the preferred 
embodiment, a correctly functioning processor may generate 
NXMs during normal operation because it can create specu- 
lative memory references. If a processor was not permitted 

50 to generate .^eculative memory references, then the soft- 
ware could be used to insure that only correct addresses were 
used. If one were to consider a memory location A in 
("DRAM") memory, this location could be referenced when 
a processor used either the address A' (unstriped) or A" 

55 (striped). Only one of A and A" arc valid addresses. The 
stripe control register 183 guarantees that only one of A' and 
A" exist at the same time. If A' is the legitimate address, then 
any reference to A using the A" address will be a NXM 
reference — the reference is nulled and the address A" is not 

60 allowed to be loaded into the cache. 

Referring still to FIG. 2b, processor 100 preferably 
includes dual, integrated RAMbus memory controllers 190 
(identified as ZboxO and Zboxl). Thus, each processor 
preferably includes two memory ports (referred to herein as 

65 port 0 and port 1). Each Zbox controller 190 controls 4 or 5 
channels of information flow with the main memory 102 
(FIG. 1). Each Zbox preferably includes a front-end direc- 
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tory in-flight table ("DIFT') 191, a middle mapper 192, and 
a back end 193. The front-end DIPT 191 performs a number 
of functions such as managing the processor's directory- 
based memory coherency protocol, processing request com- 
mands from the Cbox 170 and Rbox 200, sending forward 
commands to the Rbox, sending response commands to and 
receiving packets from the Cbox and Rbox, and tracking up 
to 32 in-flight transactions. The front-end DIPT 191 also 
sends directory read and write requests to the Zbox and 
conditionally updates directory information based on 
request type, Local Probe Response ("LPR") status and 
directory state. 

The middle mapper 192 maps the physical address into 
RAMbus device format by device, bank, row, and column. 
Tlie middle mapper 192 also maintains an open-page table 
to track all open pages and to close pages on demand if bank 
conflicts arise. The mapper 192 also schedules RAMbus 
transactions such as timer-base request queues. The Zbox 
back end 193 preferably packetizes the address, control, and 
data into RAMbus format and provides the electrical inter- 
face to the RAMbtis devices themselves. 

The Rbox 200 provides the interfaces to as many as four 
other processors and one I/O controller 104 (FIG. 1). The 
inter-processor interfaces are designated as North ("N"), 
South ("S"), East ('^E"), and West ("W") and provide 
two-way communication between adjacent processors. 

According to the preferred embodiment, the present 
invention includes the capability of striping data across a 
local set of processors, or of using processor contiguous 
addressing, depending on the status of the mask bits in the 
Cbox stripe control register 183. Thus, if the mask bit is set 
in stripe control register 183 for a memory block, references 
to that memory block must implement stripe addressing. 
Consequently, the present invention supports the ability to 
perform both striped addressing among a local processor set, 
and the ability to perform convention processor contiguous 
addressing within the memory block of a particular proces- 
sor. 

In the preferred embodiment, the address command signal 
used for striped memory addressing differs from that used 
for processor contiguous addressing. Referring now to 
PIGS. 4a and 4b, the core address space preferably com- 
prises 44 physical bits. The most significant bit, bit 43, 
selects between I/O space and memory cacheable space. The 
44 bit address space supports up to 256 processors and 256 
I/O ASICs configured with up to 16 GB of associated 
memory. Larger memory sizes are also possible (such as 32 
GB), by proportionally reducing the number of processors 
that are supported. 

The memory address space preferably is defined as a 
processor-contiguous address or a striped address based on 
the status of stripe bit 36. Thus, a given memory block may 
be accessed using either processor-contiguous or striped 
addressing, though preferably only one addressing mecha- 
nism is tised at any given time to access a particular memory 
block. The stripe bit 36 must be carried with the physical 
address offset so that conversion between a core address and 
a network address can be performed, as required. 

Referring now particularly to FIG. 4A(and assuming 256 
processors with 16 GB of memory storage capabihty per 
processor), bit 43 is set to zero to indicate a memory- 
cacheable address. The particular processor being accessed 
is identified by the processor identification ("PID") bits 7:0. 
The eight processor identification provide bits provide iden- 
tification of which of the 256 different processors is selected 
for a given memory access. Bits 42-37 of the address space: 



10 



20 



25 



30 



35 



45 



50 



55 



identffy the upper six bits (PID bits 7:2) of the processor 
identification number where the memory access is targeted. 
For processor contiguous addressing, the lower two bits of 
the processor identification number (PID bits 1:0) arc 
located in bits 35-34 of the address space. Positioned 
between the lower and upper PID bits is the stripe bit, which 
comprises bit 36 of the address space. In the case of 
processor-contiguous memory space addressing, bit 36 is set 
to "0**, as shown in FIG. 4a. Bits 33-0, which support up to 
16 GB of memory per processor, determine the memory 
ofifeet at the targeted processor. If larger memory sizes are 
desired on a per processor basis, then bit 42 may be used as 
another memory offset bit, instead of as a processor ID bit, 
thus limiting the number of processors supported to 128. As 
shown in FIG, 4a, bits 5-0 specify the offset within the 
cache block, and bit 6 preferably represents the port bit 
identifying the memory port being addressed. 

Referring now to PIG. 4b, the striped memory space also 
is partitioned with bit 43 identifying a memory cacheable 
address (with a logic "0") or an I/O access, and bits 43-37 
specifying the upper six bits of the processor ID (PID 7:2). 
Bit 36, which is the stripe bit, is set to "1" to indicate a 
striped memory address. Address space bits 35-9 and 6-0 
identify the memory offset within the selected processor, 
thereby supporting up to 16 GB of memory per processor. 
For striped addressing, address bits 7-8 identify the lower 
two bits of the processor ID (PID 1:0). In addition, in 
accordance with the preferred embodiment, bit 6 identifies 
the memory port of the processor. Bits 5-0 identify the low 
order address offiset at that processor. If a larger memory size 
is desired, then bit 42 may be used as a high order offset bit 
instead of as a processor ID bit. In the preferred 
embodiment, striping Is only allowed or disallowed in 
blocks or chunks of memory. Thus, according to the pre- 
ferred embodiment, striping is determined with a 256 MB 
granularity in 1 6 GB per processor configurations. In 32 GB 
per processor configurations, striping can only be allowed or 
disallowed with a granularity of 512 MB. 

The present invention includes the capability to select 
between processor-contiguous addressing and striping based 
upon the type of data and instructions to be stored in 
memory. Some applications have instruction streams and 
data structures that are only used by a single processor. 
Storing in contiguous memory space reduces memory 
latency for that data when accessed by the local processor 
exclusively. Striped addressing, conversely, makes memory 
latency more uniform for data and instructions that are used 
by more than one processor, because it permits portions of 
the data to be retrieved from different processors. 

When processor contiguous addressing is used, data and 
instructions are stored within the memory of a single 
processor, in accordance with normal convention. In the 
preferred embodiment, the processor contiguous memory 
space ranges are pre-assigned based on the following Table 
I. 

TABLE I 



60 


PROCESSOR # 


PID 


LOWER RANGE 


UPPER RANGE 




0 


0000000 


000.0000.0000 


003.FFFF.FFFF 




1 


0000001 


004.0000.0000 


007.FFFF.FFFF 




2 


0000010 


008.0000.0000 


OOB.FFFF.FFFF 




3 


0000011 


OOC.0000.0000 


OORFFFF.FFFF 




4 


0000100 


020.0000.0000 


023.FFFF.FFFF 


65 


5 


0000101 


024.0000.0000 


027.FFFF.FFFF 




6 


0000110 


028.0000.0000 


02b.FFFF.FFFF 
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TABLE l-continued 
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' PROCESSOR # 


FID 


LOWER RANGE 


UPPER RANGE 




7 


0000111 


02C.0000.0000 


02F.FFET.FFFF 


5 


127 


lllllU 


3EC0O00,O0OO 


3EKFFFF.FETFF 





The striped memory space ranges preferably are pre- 
assigned based on the following Table II. 



TABLE II 



PRO- 
CESSOR 



# 


PID 


LOWER RANGE 


UPPER RANGE 


0-3 


0000000-0000011 


010.0000.0000 


OIF.FFFF.FFFF 


4-7 


0000100-0000111 


030.0000.0000 


03EFFFF.FFFF 


8-11 


0001000-0000111 


050.0000.0000 


05F.FFFF.FFFF 


12-15 


0001000-0001111 


070.0000.0000 


07F.FFFF.FFFF 


16-19 


001000-flOlOOll 


090.0000.0000 


09F.FFFF.FFFF 


20-23 


0010100-0010111 


OBO.0000.0000 


OBF.FFFP.FFFF 


24-27 


0011000-0011011 


ODO.0000.0000 


ODF.FFFF.FFFF 


28-31 


0011100-0011111 


OFO.0000.0000 


OFFFFFF.FFFF 


124-127 


iniioo-1111111 


3F0.O000.OO0O 


3FFFFFF.FFFF 
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Referring now to FIG. 3, a portion of an exemplary 
multiple processor computer system is shown with 16 pro- 
cessors that are identified as lOOa-lOOp. Each processor 
preferably includes two memory ports (port 0 and port 1) 
which connect to two memory banks 302, 304. According to 
one exemplary embodiment, one of the two ports may be 
configured with cache memory, while the other comprises 
standard DRAM memory. Also, as noted above, address 
space bit 6 identifies the port being addressed for a given 
processor. 

As shown in FIGS. 3 and 4b, the processors are split into 
local striped sets based upon processor identification num- 
bers. Thus, the local striped processor set 325 comprises aU 
that have the PID bits OOOOOOxx. Put differently, local 
striped set comprises all processors with the first six PID bits 
of 000000. As shown in FIG. 3, these six bits would be 
represented in address space bits 42-37. The last two PID 
bits, which are found in address space bits 8-7, identify 
which of the four processors is referenced. Thus, if the last 
two PID bits are 00, that indicates that processor 100a is 
targeted (for a complete PID of 00000000). Similariy, the 
last two PID bits of processor lOOZ? arc 01, while processors 
100c and lOOd have the last two PID bits of 10 and 11. 

The second local striped processor set 350 are those 
processors lOOe-100^ with the first six PID bits of 000001 
in address space bits 42-37. The specific processor 
lOOe-lOOA is identified by the four possible bit combina- 
tions present in address space bits 7-S, which are the two 
least significant PID bits. To further continue the example, 
the third local striped processor set 375 are processors 
lOOi-100/ with the first six PID bits of 000010 in address 
space bits 42-37. The specific processor lOOf-100/ is iden- 
tified by the four possible bit combinations present in 
address space bits 7-8, which are the two least significant 
PID bits- 

Because the lower two PID bits are identified in the low 
order bits of the address space for striped addressing, 
incrementing of addresses causes the data to be striped 
between the four processors in the local striped processor 
set. Thus, as data or instructions are stored, the operating 
system will increment the value in the address space. If 
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Striped addressing is indicated, this process automatically 
causes the data to be stored in an interleave fashion among 
the local four processor set. This then permits the data to be 
accessed in parallel from the four processors. 

It should be understood that the local striped processor 
sets does not necessarily mean that these sets can only be 
referenced by striped addressing. As indicated above, in the 
preferred embodiment either striped or processor contiguous 
addressing may be used for all system memory. The deter- 
mination of whether to use striped addressing for a memory 
block preferably is based on the mask in the stripe control 
register 183 (FIG. 2b). 

Moreover, it should also be understood that the system is 
capable of translating between processor contiguous 
addressing and striped addressing, as required. In addition, 
a special addressing command may be used to transfer data 
requests between processors. Thus, in the preferred 
embodiment, a source processor identifies the memory block 
it needs, and sends a request in canonical form to the local 
processor that is acting as the memory controller for that 
data. If the data is stored in striped form (as indicated by the 
mask in the Cbox stripe control register), local processor 
must translate the canonical form of the address to the 
striped address to retrieve the requested data. Moreover, the 
address may need to be relayed to another processor that has 
the data in its cache, in which case the address must be 
re -converted back to canonical form for sending to the other 
processor who has what may be a dirty copy of the data in 
its cache. 

The above discussion is meant to be illustrative of the 
principles and various embodiments of the present inven- 
tion. Numerous variations and modifications will become 
apparent to those skilled in the art once the above disclosure 
is fully appreciated. Thus, for example, although only two 
addressing schemes are disclosed in the preferred embodi- 
ment (which are striped and processor contiguous 
addressing), other addressing schemes may also be used 
simultaneously in addition to or as an alternative to these 
two addressing schemes. To implement addition addressing 
schemes, more than one bit would be used for the stripe bit. 
Thus, for example, if two bits were dedicated to the address 
type, four addressing schemes could be used simultaneously. 
In addition, the bit field positions and widths of the address- 
ing space could be varied for each addressing scheme 
without departing from the principles of the present inven- 
tion. It is intended that the following claims be interpreted 
to embrace all such variations and modifications. 

What is claimed is: 

1. A computer system, comprising: 

a plurality of processors that are coupled together; 

a memory associated with each of said plurality of 
processors, wherein each of said plurality of processors 
is capable of accessing the memory associated with any 
other processor; 

wherein, in accordance with a stripe bit in an address 
signal, data is stored in any of the memories associated 
with said plurality of processors on either a processor 
contiguous basis, or by striping across multiple proces- 
sors in a stripe set; and 

wherein said address signal includes a field when striping 
across processors, said field includes a first sub-field 
that specifies the processors that comprise a stripe set 
and a second sub-field that specifies the number of 
processors in the stripe set. 

2. The computer system of claim 1, wherein memory is 
accessed using address command signals whose coding 
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differs depending on whether memory is stored on a pro- 
cessor contiguous basis or a striped basis. 

3. The computer system of claim 2, wherein the address 
command signal includes a processor identification bit field 
that identifies the target of a memory access and contains the s 
first and second sub -fields. 

4. The computer system of claim 3, wherein the processor 
identification bit field includes n bits identifying 2" 
processors, and a portion of said n bits is located in low order 
address space for a striped memory access. lo 

5. The computer system of claim 4, wherein said low 
order address space resides in the lower two bytes of address 
space. 

6. The computer system of claim 5, wherein said low 
order address space resides in the lowest byte of address j5 
space. 

7. The computer system as in claim 4, wherein said 
portion of said n bits comprises two bits to identify a four 
processor striped set across which data is striped. 

8. The computer system of claim 4, wherein said address 20 
command signal includes bits representing a memory ofket, 
and said low order space resides in bits that are less 
significant than at least a portion of the memory oflket bits. 

9. The computer system of claim 1, wherein at least one 

of said plurality of said processors includes a first and 25 
second memory port, and said memory associated with said 
at least one processor comprises a first memory sub-system 
and a second memory sub-system, which are respectively 
coupled to said first and second memory port. 

10. The computer system as in claim 9, wherein said first 30 
memory sub-system comprises DRAM memory, and said 
second memory sub -system comprises cache memory. 

11. The computer system as in claim 10, wherein said at 
least one of said plurality of said processors includes an 
associated memory controller for each memory port. 35 

12. The computer system as in claim 1, wherein said 
plurality of said processors include a memory controller that 
interfaces said processor to said associated memory. 

13. The computer system of claim 1, wherein said plu- 
rality of said processors include a stripe control register that 
includes a mask for identifying which memory blocks are to 
be accessed with striped addressing and which memory 
blocks are to be accessed with processor contiguous address- 
ing. 

14. The computer system of claim 13, wherein said stripe 45 
control register comprises an internal processor register. 

15. The computer system of claim 4, wherein the proces- 
sor identification bit field includes n bits identifying 2" 
processors, and said n bits are located in high order address 
space for a processor contiguous address. 50 

16. The computer system of claim 1, where data includes 
instructions. 

17. The computer system of claim 1, wherein said pro- 
cessors are grouped into local stripe sets based upon the 
lowermost bits in a processor identification bit field. 55 

18. The computer system of claim 17, wherein said local 
stripe sets includes four processors that are determined by 
the two lowermost bits in said processor identification field, 

19. The computer system of claim 1, wherein memory 
accesses using striped addressing and memory accesses 
using processor contiguous addressing occur simultaneously 
in said computer system. 

20. A computer system, comprising: 

a plurality of processors that are coupled together; 

a memory associated with each of said plurality of 65 
processors, wherein each of said plurality of processors 
is capable of accessing the memory associated with any 
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Other processor on either a processor contiguous basis, 
or on a stripe basis across multiple processors in a stripe 
set; and 

an I/O controller, associated with each of said plurality of 
processors, capable of interfacing with I/O devices, 
wherein each of said plurality of processors is capable 
of accessing I/O devices associated with any other 
processor; 

wherein, in accordance with a stripe bit in an address 
command signal, data is stored in any of the memories 
associated with said plurality of processors on either a 
processor contiguous basis, or by striping across mul- 
tiple processors in a stripe set; and 

wherein said address command signal includes a field 
when striping across processors, said field includes a 
first sub-field that specifies the processors that comprise 
a stripe set and a second sub-field that specifies the 
number of processors in the stripe set. 

21. The computer system of claim 20, wherein memory 
accesses include an address command signal that differs 
depending on whether memory is accessed on a processor 
contiguous basis or a striped basis, and the address com- 
mand signal includes a processor identification bit field that 
identifies the target of a memory access and which includes 
said first and second sub-fields. 

22. The computer system of claim 21, wherein the pro- 
cessor identification bit field for a stripe memory access 
includes n bits identifying 2" processors, and wherein said 
first sub-field is located in high order address space, and said 
second sub-field is located in low order address space. 

23. A computer system, comprising: 

a plurality of processors that are coupled together; 

a memory associated with each of said plurality of 
processorsf, wherein each of said plurality of processors 
is capable of accessing the memory associated with any 
other processor on either a processor contiguous basis, 
or on a stripe basis across multiple processors in a stripe 
set; and 

an I/O controller, associated with each of said plurality of 
processors, capable of interfacing with I/O devices, 
wherein each of said plurality of processors is capable 
of accessing I/O devices associated with any other 
processor; 

wherein memory accesses include an address command 
signal that differs depending on whether memory is 
accessed on a processor contiguous basis or a striped 
basis, and the address command signal includes a 
processor identification bit field that identifies the target 
of a memory access; 

wherein the address command signal includes a stripe bit 
that indicates if the address command signal is a striped 
memory access or a processor contiguous memory 
access; 

wherein the processor identification bit field for a stripe 
memory access includes n bits identifying 2" 
processors, and said processor identification bit field 
includes a first portion y and a second portion x, and 
wherein said first portion y is located in high order 
address space, and said second portion x is located in 
low order address space; and 

wherein the first portion y includes a bit field that defines 
the processors that comprise the stripe set, and the 
second portion x includes a bit field that defines the 
number of processors in said stripe set. 

24. The computer system of claim 23, wherein said low 
order address space resides in the lowest byte of address 
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Space and said high order space resides in the highest byte 
of address space. 

25. The computer systena as in claim 24, wherein said 
processor identification bit field a comprises at least seven 
bits, and the second portion x comprises two bits identifying 
four processors within each stripe set. 

26. The computer system of claim 21, wherein the address 
command signal includes a bit that identifies if an access 
targets memory or an I/O controller. 

27. The computer system as in claim 20, wherein said 
plurality of said processors include a memory controller that 
interfaces said processor to said associated memory. 
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28. The computer system of claim 27, wherein said 
memory controller includes a stripe control register that 
includes a mask for identifying which memory blocks in 
said associated memory are to be accessed with striped 
addressing and which memory blocks in said associated 
memory are to be accessed with processor contiguoiis 
addressing. 

29. The computer system of claim 20, wherein memory 
accesses using striped addressing and memory accesses 
using processor contiguous addressing occur simultaneously 
in said computer system to different memory blocks. 
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